This report provides an evaluation of the accuracy and precision of probabilistic forecasts submitted to the COVID-19 Forecast Hub over the last 10 weeks. The forecasts evaluated were submitted during the time period from October 27, 2020 through January 04, 2021. The revision dates of this data was calculated as of 2021-01-12.
In this weekly report we are evaluating forecasts made in 57 different locations (US on a national level, 50 states, and 6 territories), for 4 horizons over 10 submission weeks. We are evaluating 3 targets including incident cases, incident deaths, and cumulative deaths.
In collaboration with the US CDC, our team collects COVID-19 forecasts from dozens of teams around the globe. Each Monday evening or Tuesday morning, we combine the most recent forecasts from each team into a single "ensemble" forecast for each of the target submissions.
Typically on Wednesday or Thursday of each week, a summary of the week's forecast from the COVID-19 Forecast Hub, including the ensemble forecast, appear on the official CDC COVID-19 forecasting page.
This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated.
The figure below shows the number of locations that each model submitted forecasts for during this evaluation period. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.
This figure shows the number of locations each model subimitted a weekly incidence case forecast for. The maximum number of locations is 57, which includes all 50 states, a National level forecast, and 6 US territories.
The number of models who submitted forecasts for incident cases is 39. The number of models that submitted forecasts for all 10 weeks was 36. The number of teams that submitted forecasts for all 57 locations was 8.
Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative mean average error (MAE) of each model. The data in this figure is aggregated across all submission weeks, locations, and horizons.
For inclusion in this table, a team must have submitted a model for at least 6 out of the last 10 weeks. A model was counted if it included at least 25 locations and forecasts for 1 - 4 week ahead horizons.
Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95.
The relative WIS and relative MAE are calculated using a pairwise approach to acount variation in the difficulty of forecasting different weeks and locations. Models with a relative WIS or MAE lower than 1 are more accurate than the baseline and models with a relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident deaths. The code for this comparison can be found here.
In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view specific teams, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.
In this figure, the dotted black line represents the average 1 week ahead error. There is often larger variation in error for the 4 week horizon compared to the 1 week horizon.
We would expect a well calibrated model would have a value of 80% in this plot.
We would expect a well calibrated model would have a value of 80% in this plot. There is typically larger variation in error for the 4 week horizon compared to the 1 week horizon.
The following figure shows the scores of models aggregated by horizon and submission week. In this figure, we only include models that have submitted forecasts for all 4 horizons and all n_weeks_eval submission weeks evaluated. The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast.
This plot shows the observed number of indicent deaths over the evaluation period.
In the 10 week evaluation period, the evaluated Saturdays are 2020-11-07 through 2021-01-09. models submitted incident death forecasts. The number of models who submitted forecasts for incident deaths is 52. The number of models that submitted forecasts for all 10 was 50. The number of teams that submitted forecasts for all locations was 10.
The figure below shows the number of locations that each model submitted incident death forecasts for during this evaluation period. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.
Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative mean average error (MAE) of each model. The data in this figure is aggregated across all submission weeks, locations, and horizons.
For inclusion in this table, a team must have submitted a model for at least 6 out of the last 10 weeks. A model was counted if it included at least 25 locations and forecasts for 1 - 4 week ahead horizons.
Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95.
The relative WIS and relative MAE are calculated using a pairwise approach to acount variation in the difficulty of forecasting different weeks and locations. Models with a relative WIS or MAE lower than 1 are more accurate than the baseline and models with a relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident deaths. The code for this comparison can be found here.
In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view specific teams, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.
In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.
The black line represents 80%
The black line represents 80%
Finally, we have evaluated which locations teams had the lowest WIS scores for. In this figure, models were included if they submitted forecasts for all submission weeks and all horizons. The WIS scores stratified by location are included in each box. The color scheme shows the WIS score relative to the baseline.
This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated.
The figure below shows the number of locations that each model submitted forecasts for during this evaluation period. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.
The number of models who submitted forecasts for cumulative deaths is 52. The number of models that submitted forecasts for all 10 weeks was 51. The number of teams that submitted forecasts for all 57 locations was 11.
The figure below shows the number of locations and weeks that each team has submitted forecasts for.
Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative mean average error (MAE) of each model. The data in this figure is aggregated across all submission weeks, locations, and horizons.
For inclusion in this table, a team must have submitted a model for at least 6 out of the last 10 weeks. A model was counted if it included at least 25 locations and forecasts for 1 - 4 week ahead horizons.
Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95.
The relative WIS and relative MAE are calculated using a pairwise approach to acount variation in the difficulty of forecasting different weeks and locations. Models with a relative WIS or MAE lower than 1 are more accurate than the baseline and models with a relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident deaths. The code for this comparison can be found here.
In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view specific teams, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.
In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.
The black line represents 80%
The black line represents 80%